Poisson Regression

Statistics for Data Science II

Introduction

  • Suppose we are faced with count data.

    • This is discrete data, not continuous.
  • Fortunately, the Poisson distribution is appropriate for count data.

  • The Poisson regression model is as follows:

\ln\left( y \right) = \beta_0 + \beta_1 x_1 + ... + \beta_k x_k

  • Note that this is similar to logistic regression in that we are using the natural log when modeling the outcome.

Today’s Data

  • candidatevotes: votes received by this candidate for this particular party

  • totalvotes: total number of votes cast for this election

  • year: year in which election was held

  • region: region that state belongs to, as defined by the US Census Bureau

    • I derived this; please examine the code to see how.
library(tidyverse)
library(fastDummies)

house <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-11-07/house.csv') %>%
  filter(stage == "GEN" &
         party %in% c("REPUBLICAN", "DEMOCRAT") &
         state != "DISTRICT OF COLUMBIA") %>%
  mutate(candidatevotes = candidatevotes + 1,
         year10 = year/10,
         region = if_else(state %in% c("CONNECTICUT", "MAINE", "MASSACHUSETTS", "NEW HAMPSHIRE", "RHODE ISLAND", "VERMONT", "NEW JERSEY", "NEW YORK", "PENNSYLVANIA"), "Northeast",
                          if_else(state %in% c("ILLINOIS", "INDIANA", "MICHIGAN", "OHIO", "WISCONSIN", "IOWA", "KANSAS", "MINNESOTA", "MISSOURI", "NEBRASKA", "NORTH DAKOTA", "SOUTH DAKOTA"), "Midwest",
                                  if_else(state %in% c("DELAWARE", "FLORIDA", "GEORGIA", "MARYLAND", "NORTH CAROLINA", "SOUTH CAROLINA", "VIRGINIA","WEST VIRGINIA", "ALABAMA", "KENTUCKY", "MISSISSIPPI", "TENNESSEE", "ARKANSAS", "LOUISIANA", "OKLAHOMA", "TEXAS"), "South", "West")))) %>%
  dummy_cols(select_columns = c("region"))

Data Exploration

house %>% ## BOX PLOT 1 ##
  ggplot(aes(y=candidatevotes, x=region)) + 
  geom_boxplot() +
  theme_bw()
house %>% ## BOX PLOT 2 ##
  filter(candidatevotes < 450000) %>%
  ggplot(aes(y=candidatevotes, x=region)) + 
  geom_boxplot() +
  theme_bw()
house %>% ## HISTOGRAM 1 ##
  filter(candidatevotes < 450000) %>%
  ggplot(aes(x=candidatevotes)) + 
  geom_histogram(bins = 45) +
  theme_bw()
house %>% ## HISTOGRAM 2 ##
  filter(candidatevotes < 450000) %>%
  ggplot(aes(x=candidatevotes)) + 
  geom_histogram(bins = 45) +
  theme_bw()

R Syntax

  • We are able to again use the glm() function when specifying the Poisson distribution.
m <- glm(outcome ~ var_1 + var_2 + ... + var_k, 
         family="poisson",
         data=dataset)
  • Everything we have learned about how to use model results applies here as well.

Example

  • Let’s use the election data to model the number of votes received as a function of the year, the region, the interaction between the party and the year, and the interaction between the party and region.
m <- glm(candidatevotes ~ year10 +  party + region_Midwest + region_Northeast + region_West + 
           party:year10 + 
           party:region_Midwest + party:region_Northeast + party:region_West + totalvotes,
         family = "poisson",
         data = house)
summary(m)
  • Why did I also include totalvotes in the model?

Call:
glm(formula = candidatevotes ~ year10 + party + region_Midwest + 
    region_Northeast + region_West + party:year10 + party:region_Midwest + 
    party:region_Northeast + party:region_West + totalvotes, 
    family = "poisson", data = house)

Coefficients:
                                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)                      -2.619e+00  4.553e-03  -575.3   <2e-16 ***
year10                            6.886e-02  2.277e-05  3024.3   <2e-16 ***
partyREPUBLICAN                  -4.398e+00  6.365e-03  -691.0   <2e-16 ***
region_Midwest                    1.111e-01  8.373e-05  1327.2   <2e-16 ***
region_Northeast                  1.204e-01  8.696e-05  1384.0   <2e-16 ***
region_West                       1.122e-01  8.471e-05  1324.6   <2e-16 ***
totalvotes                        1.587e-06  1.200e-10 13230.0   <2e-16 ***
year10:partyREPUBLICAN            2.239e-02  3.177e-05   704.8   <2e-16 ***
partyREPUBLICAN:region_Midwest   -5.855e-02  1.161e-04  -504.2   <2e-16 ***
partyREPUBLICAN:region_Northeast -3.096e-01  1.273e-04 -2431.8   <2e-16 ***
partyREPUBLICAN:region_West      -1.865e-01  1.213e-04 -1536.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 553349377  on 19566  degrees of freedom
Residual deviance: 382263499  on 19556  degrees of freedom
AIC: 382521177

Number of Fisher Scoring iterations: 5
coefficients(m)
                     (Intercept)                           year10 
                   -2.619027e+00                     6.885689e-02 
                 partyREPUBLICAN                   region_Midwest 
                   -4.398204e+00                     1.111224e-01 
                region_Northeast                      region_West 
                    1.203553e-01                     1.122048e-01 
                      totalvotes           year10:partyREPUBLICAN 
                    1.587455e-06                     2.239381e-02 
  partyREPUBLICAN:region_Midwest partyREPUBLICAN:region_Northeast 
                   -5.855104e-02                    -3.096250e-01 
     partyREPUBLICAN:region_West 
                   -1.864849e-01 

\begin{align*} \ln(\hat{y}) =& -2.62 + 0.07 \text{ year} - 4.40 \text{ repub.} + 0.11 \text{ midwest} + 0.12 \text{ northeast} + 0.11 \text{ west} \\ & + 0.02 (\text{repub. $\times$ year}) \\ & - 0.06 (\text{repub. $\times$ midwest}) - 0.31 (\text{repub. $\times$ northeast}) - 0.19 (\text{repub. $\times$ west}) \end{align*}

Interpretations

  • In Poisson regression, we convert the \hat{\beta}_i values to incident rate ratios (IRR).

\text{IRR}_i = \exp\left\{\hat{\beta}_i\right\}

  • This is a multiplicative effect, like an odds ratio in logistic regression.

    • An IRR > 1 indicates an increase in the expected count.

    • An IRR < 1 indicates a decrease in the expected count.

  • We also interpret the IRR similar to the odds ratio:

    • For a 1 [unit of predictor] increase in [predictor name], the expected count of [outcome] is multiplied by \left[ e^{\hat{\beta}_i} \right].

    • For a 1 [unit of predictor] increase in [predictor name], the expected count of [outcome] are [increased or decreased] by [100(e^{\hat{\beta}_i}-1)% or 100(1-e^{\hat{\beta}_i})%].

Example

  • Let’s provide interpretations when year10 = 198.3 (i.e., 1983).
#coef(m)
exp(0.06885689 + 0.02239381*1) # year10
[1] 1.095544
exp(0.1111224 - 0.05855104*1) # midwest
[1] 1.053978
exp(0.1203553 - 0.3096250*1) # northeast
[1] 0.8275633
exp(0.1122048 - 0.1864849*1) # west
[1] 0.9284116
  • Every year, the expected count of votes increases by 10%.

  • Those in the midwest have an expected number of votes that is 1.05 times the votes for those in the south.

    • This is a 5% increase.
  • Those in the northeast have an expected number of votes that is 0.83 times the votes for those in the south.

    • This is a 17% decrease.
  • Those in the west have an expected number of votes that is 0.93 times the votes for those in the south.

    • This is a 7% decrease.
#coef(m)
exp(0.06885689 + 0.02239381*0) # year10
[1] 1.071283
exp(0.1111224 - 0.05855104*0) # midwest
[1] 1.117532
exp(0.1203553 - 0.3096250*0) # northeast
[1] 1.127898
exp(0.1122048 - 0.1864849*0) # west
[1] 1.118742
  • Every year, the expected count of votes increases by 7%.

  • Those in the midwest have an expected number of votes that is 1.12 times the votes for those in the south.

    • This is a 12% increase.
  • Those in the northeast have an expected number of votes that is 1.13 times the votes for those in the south.

    • This is a 13% increase.
  • Those in the west have an expected number of votes that is 1.12 times the votes for those in the south.

    • This is a 12% increase.

Inference - Confidence Intervals

  • We again find confidence intervals as we did before, using the confint() function.
confint(m)
                                         2.5 %        97.5 %
(Intercept)                      -2.627950e+00 -2.610104e+00
year10                            6.881227e-02  6.890152e-02
partyREPUBLICAN                  -4.410679e+00 -4.385728e+00
region_Midwest                    1.109583e-01  1.112865e-01
region_Northeast                  1.201849e-01  1.205258e-01
region_West                       1.120387e-01  1.123708e-01
totalvotes                        1.587220e-06  1.587690e-06
year10:partyREPUBLICAN            2.233154e-02  2.245609e-02
partyREPUBLICAN:region_Midwest   -5.877865e-02 -5.832343e-02
partyREPUBLICAN:region_Northeast -3.098746e-01 -3.093755e-01
partyREPUBLICAN:region_West      -1.867227e-01 -1.862471e-01

Inference - Hypothesis Testing

  • We take the same approach to determining significance as before.

  • A single term in the model \to summary().

  • Multiple terms in the model \to car::Anova().

    • This is required for a categorical variable with three or more categories.

Example

  • Is the interaction between party and year significant?
summary(m)

Call:
glm(formula = candidatevotes ~ year10 + party + region_Midwest + 
    region_Northeast + region_West + party:year10 + party:region_Midwest + 
    party:region_Northeast + party:region_West + totalvotes, 
    family = "poisson", data = house)

Coefficients:
                                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)                      -2.619e+00  4.553e-03  -575.3   <2e-16 ***
year10                            6.886e-02  2.277e-05  3024.3   <2e-16 ***
partyREPUBLICAN                  -4.398e+00  6.365e-03  -691.0   <2e-16 ***
region_Midwest                    1.111e-01  8.373e-05  1327.2   <2e-16 ***
region_Northeast                  1.204e-01  8.696e-05  1384.0   <2e-16 ***
region_West                       1.122e-01  8.471e-05  1324.6   <2e-16 ***
totalvotes                        1.587e-06  1.200e-10 13230.0   <2e-16 ***
year10:partyREPUBLICAN            2.239e-02  3.177e-05   704.8   <2e-16 ***
partyREPUBLICAN:region_Midwest   -5.855e-02  1.161e-04  -504.2   <2e-16 ***
partyREPUBLICAN:region_Northeast -3.096e-01  1.273e-04 -2431.8   <2e-16 ***
partyREPUBLICAN:region_West      -1.865e-01  1.213e-04 -1536.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 553349377  on 19566  degrees of freedom
Residual deviance: 382263499  on 19556  degrees of freedom
AIC: 382521177

Number of Fisher Scoring iterations: 5
  • Yes! p < 0.001.

Example

  • Is the interaction between party and region significant?
  • Yes! p < 0.001.
m2 <- glm(candidatevotes ~ year10 +  party + region + party:year10 + party:region + totalvotes, family = "poisson", data = house)
car::Anova(m2, type = 3)

Data Visualization

  • Let’s create a couple of data visualizations.

    • Hold year constant at 1983.

    • Let total number of votes vary on the x-axis.

    • Define lines for each region…

      • and graphs for each politcal party.

Data Visualization

# coefficients(m) # get coeff

# create full model before plugging in
#-2.619027 + 0.06885689*year10 -4.398204*repub + 0.1111224*midwest + 0.1203553*northeast + 0.1122048*west + 0.000001587455*totalvotes + 0.02239381*year10*repub - 0.05855104*midwest*repub - 0.3096250*northeast*repub - 0.1864849*west*repub

# year10 = 1983
#-2.619027 + 0.06885689*198.3 -4.398204*repub + 0.1111224*midwest + 0.1203553*northeast + 0.1122048*west + 0.000001587455*totalvotes + 0.02239381*198.3*repub - 0.05855104*midwest*repub - 0.3096250*northeast*repub - 0.1864849*west*repub

# republicans
#-2.619027 + 0.06885689*198.3 -4.398204*1 + 0.1111224*midwest + 0.1203553*northeast + 0.1122048*west + 0.000001587455*totalvotes + 0.02239381*198.3*1 - 0.05855104*midwest*1 - 0.3096250*northeast*1 - 0.1864849*west*1

# democrats
#-2.619027 + 0.06885689*198.3 -4.398204*0 + 0.1111224*midwest + 0.1203553*northeast + 0.1122048*west + 0.000001587455*totalvotes + 0.02239381*198.3*0 - 0.05855104*midwest*0 - 0.3096250*northeast*0 - 0.1864849*west*0
house <- house %>%
  mutate(r_mw = exp(-2.619027 + 0.06885689*198.3 -4.398204*1 + 0.1111224*1 + 0.1203553*0 + 0.1122048*0 + 0.000001587455*totalvotes + 0.02239381*198.3*1 - 0.05855104*1*1 - 0.3096250*0*1 - 0.1864849*0*1),
         r_ne = exp(-2.619027 + 0.06885689*198.3 -4.398204*1 + 0.1111224*0 + 0.1203553*1 + 0.1122048*0 + 0.000001587455*totalvotes + 0.02239381*198.3*1 - 0.05855104*0*1 - 0.3096250*1*1 - 0.1864849*0*1),
         r_w = exp(-2.619027 + 0.06885689*198.3 -4.398204*1 + 0.1111224*0 + 0.1203553*0 + 0.1122048*1 + 0.000001587455*totalvotes + 0.02239381*198.3*1 - 0.05855104*0*1 - 0.3096250*0*1 - 0.1864849*1*1),
         r_s = exp(-2.619027 + 0.06885689*198.3 -4.398204*1 + 0.1111224*0 + 0.1203553*0 + 0.1122048*0 + 0.000001587455*totalvotes + 0.02239381*198.3*1 - 0.05855104*0*1 - 0.3096250*0*1 - 0.1864849*0*1),
         d_mw = exp(-2.619027 + 0.06885689*198.3 -4.398204*0 + 0.1111224*1 + 0.1203553*0 + 0.1122048*0 + 0.000001587455*totalvotes + 0.02239381*198.3*0 - 0.05855104*1*0 - 0.3096250*0*0 - 0.1864849*0*0),
         d_ne = exp(-2.619027 + 0.06885689*198.3 -4.398204*0 + 0.1111224*0 + 0.1203553*1 + 0.1122048*0 + 0.000001587455*totalvotes + 0.02239381*198.3*0 - 0.05855104*0*0 - 0.3096250*1*0 - 0.1864849*0*0),
         d_w = exp(-2.619027 + 0.06885689*198.3 -4.398204*0 + 0.1111224*0 + 0.1203553*0 + 0.1122048*1 + 0.000001587455*totalvotes + 0.02239381*198.3*0 - 0.05855104*0*0 - 0.3096250*0*0 - 0.1864849*1*0),
         d_s = exp(-2.619027 + 0.06885689*198.3 -4.398204*0 + 0.1111224*0 + 0.1203553*0 + 0.1122048*0 + 0.000001587455*totalvotes + 0.02239381*198.3*0 - 0.05855104*0*0 - 0.3096250*0*0 - 0.1864849*0*0))

# republicans
house %>% 
  ggplot(aes(x = totalvotes, y = candidatevotes)) +
  geom_point() + 
  geom_line(aes(y = r_ne), color = "pink") + 
  geom_line(aes(y = r_s), color = "hotpink") + 
  geom_line(aes(y = r_mw), color = "purple") + 
  geom_line(aes(y = r_w), color = "lightblue") +
  theme_bw()

# democrats
house %>% 
  ggplot(aes(x = totalvotes, y = candidatevotes)) +
  geom_point() + 
  geom_line(aes(y = d_ne), color = "pink") + 
  geom_line(aes(y = d_s), color = "hotpink") + 
  geom_line(aes(y = d_mw), color = "purple") + 
  geom_line(aes(y = d_w), color = "lightblue") +
  theme_bw()